Chapter 4 Exploratory Data Analysis

4.1 Start with dplyr counts and summaries in console

  • David Robinson often starts exploring data with simple counts in the console. I reccomend this as the first step in your EDA.

  • In the code below we don’t use the package name in the console (so breaking the rule I just told you). We won’t save this code for others to read so this is ok. It means we can type more quickly and explore the data with dplyr verbs faster.

4.2 Plot data points with geom_point()

  • After using dplyr count(), group_by() and summarise(), try plotting all the data points with ggplot2::geom_point(). It almost NEVER fails to show you what’s going on quickly and is unlikely to return a confusing error message.

  • ggplot2::geom_point() is the minimum and most reliable ggplot plot type (or geom) to start visualising. Increasingly I’m finding it is often a good choice to end with and use it in your final plot too.

  • Let’s look at all the values of sales for each date.

## Warning: Removed 568 rows containing missing values (geom_point).

  • Now let’s look at the individual sales values for each city.
## Warning: Removed 568 rows containing missing values (geom_point).

  • Becuase there are so many points make very dark lines where they overlap. This is known as “over plotting”. Reduce over plotting by replacing geom_point() with geom_jitter(). This randomly “jitters” the location of the data points by a small amount so that they are less likely to overlap.

  • Sometimes there are so many data points that jittering does not reduce over plotting enough. Next try making the dots lighter using a a parameter called “alpha” as in the code below. The lower the value of alpha the fainter the data points.

## Warning: Removed 568 rows containing missing values (geom_point).

  • Hadley Wickham has a few more tricks to in the overplotting chapter of his ggplot book.

  • We all know sales of most things vary by the time of the year. So let’s now put date on the x axis, make city the colour, and because the data is over time we join the data points using ggplot2::geom_line().

  • We also use the reduced data set with fewer cities so that the plot is less crowded with fewer lines.

## Warning: Removed 1 rows containing missing values (geom_path).

  • Beautiful. While sales have very different volumes between cities we can see they all tend to follow the same seasonal pattern. To bring the patterns of sales closer to each other so that they are easier to compare we can transform the sales value by taking its log. This is Hadley Wickham’s approach in ggplot2: Elegant Graphics for Data Analysis.

  • Wikham goes on to model the Texas housing sales data by fitting a linear model between the log of sales and the month, then plotting the residuals (i.e. the change in sales not explained by the month). This removes the strong seasonal effects. This is similar to the decomposition part of a classic time series analysis. Take a look at the recent fable forecasting package for an easy way to try this.

  • We will take a visual approach to reduce the seasonal effect in the Polish your final plot part of this chapter. The entire time series is plotted zoomed out with years clearly marked so that the strong monthly sales pattern that is repeated each year is more obvious.

## Warning: Removed 1 rows containing missing values (geom_path).

4.3 Facet by categories

  • So far we have shown the different sales patterns for each city by putting city into the “colour” arugment of ggplot. However, with lots of cities the plot gets too crowded. In such situations facets or “small multiples” are a good choice. This is a fancy way of saying draw a chart for each value in one or more columns then look at all the plots at once, usually in a grid.

  • An important setting for facets is to specify scales = “free” so that each small plot has its own scale set to the maximum of each city’s sales. This lets us more easily spot interesting differences in the patterns over time between plots.

## Warning: Removed 1 rows containing missing values (geom_path).

4.4 Facet interactively (trelliscopejs)

  • An interactive way to facet and explore your data with a GUI is trelliscopejs. Below we facet all the Texas cities in a trelliscope web page. Have a play with all the settings and see what it does.

4.5 Loop to plot every category separately

  • To study each city as a full single chart on its own we can loop through the cities and plot them automatically with very little code.

  • To do this we “nest” a dataframe for each city into one dataframe. Then loop through each of the nested city dataframes creating a plot for each one.

  • Here we use dplyr::group_by() on city then use tidyr::nest() to create the indivdiual nested dataframes.

  • Pipine the nested dataframe into View() shows us what a nested dataframe looks like.
  • We can also view one of the nested data frames using square brackets. Think of the numbers in the square brackets like the co-ordinates in Excel. The first number is the column position and the second number is the row position of the data frame.
  • Once you get used to square brackets it is a powerful way to navigate through nested R objects. The code above showed us the data in the first nested data frame of one city. The code below now drills down into the first column and first row of that single nested dataframe. This is just one cell so the most granular drill down possible for this object.
  • We can now add a plot to each nested data frame for each city. We use purrr::map2(). This is a compact way to loop through two arguments in a function. The function arguments being set by map2() are the nested dataframes in the data column and the city name inside the city column of the nested dataframe df_red_nest.
  • Take a look at the new nested data frame with a new column added containing a plot for each city.
  • Let’s also look at the information held for one of the plots, again using values in square brackets. The code below shows you that the plot is a series of nested lists that describe every element of the plot.
  • Finally, let’s print every plot quite simply with this code.
Show all the looped prints

## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]
## Warning: Removed 1 rows containing missing values (geom_path).

## 
## [[6]]

## 
## [[7]]

## 
## [[8]]

## 
## [[9]]

## 
## [[10]]

## 
## [[11]]

## 
## [[12]]

## 
## [[13]]

## 
## [[14]]


4.6 Polish your final plot

  • We now have a bare minimum Exploratory Data Analysis toolkit. We explore the data with dplyr::count() from the console then visualise it with ggplot2::geom_point().

  • From exploring the data quickly we are soon ready to select a plot that tells an interesting story. But adding all the bells and whistles to make the final plot for a customer or publication can and does take a long time. So this polish shouldn’t be part of your exploratory data analysis.

  • Also, make sure the polishing is done with the clean Code style recommended earlier. You will find it quicker to comment out or tweak specific parts of your plot code until it looks just right. Clean code is faster to iterate.

  • The plot below isn’t perfect. There may be things you want to change depending both on what story you want to tell and your personal style.

  • How did I write this code? By Googling for what I wanted to do (e.g. “ggplot remove axis grid lines”), copying the code from a stackoverflow answer, then pasting the code into a clear structure as below.

  • Many of the tweaks or polish will be to ggplot2::theme() or ggplot2::scale… but are you really going to remember the ggplot code you need for every adjustment you want to make? I no longer worry about remembering how to do it and just focus on how I want it to look and get absorbed in the creation and satisfaction of the plot gradually improving.

  • After you have built a few of your own publication quality pots with clear code you will soon be using your own plot code as a store of examples to re-use meaning that you will Google less.

  • Be prepared for this code tweaking and plot polishing to take longer than you planned. Always.

## label_key: city
## Saving 7 x 5 in image
## Warning: package 'gdtools' was built under R version 3.6.1
## Warning: Removed 430 rows containing missing values (geom_path).